57 research outputs found
Understanding High Dimensional Spaces through Visual Means Employing Multidimensional Projections
Data visualisation helps understanding data represented by multiple
variables, also called features, stored in a large matrix where individuals are
stored in lines and variable values in columns. These data structures are
frequently called multidimensional spaces.In this paper, we illustrate ways of
employing the visual results of multidimensional projection algorithms to
understand and fine-tune the parameters of their mathematical framework. Some
of the common mathematical common to these approaches are Laplacian matrices,
Euclidian distance, Cosine distance, and statistical methods such as
Kullback-Leibler divergence, employed to fit probability distributions and
reduce dimensions. Two of the relevant algorithms in the data visualisation
field are t-distributed stochastic neighbourhood embedding (t-SNE) and
Least-Square Projection (LSP). These algorithms can be used to understand
several ranges of mathematical functions including their impact on datasets. In
this article, mathematical parameters of underlying techniques such as
Principal Component Analysis (PCA) behind t-SNE and mesh reconstruction methods
behind LSP are adjusted to reflect the properties afforded by the mathematical
formulation. The results, supported by illustrative methods of the processes of
LSP and t-SNE, are meant to inspire students in understanding the mathematics
behind such methods, in order to apply them in effective data analysis tasks in
multiple applications
LDPP at the FinNLP-2022 ERAI task: Determinantal point processes and variational auto-encoders for identifying high-quality opinions from a pool of social media posts
Social media and online forums have made it easier for people to share their views and opinions on various topics in society. In this paper, we focus on posts discussing investment related topics. When it comes to investment , people can now easily share their opinions about online traded items and also provide rationales to support their arguments on social media. However, there are millions of posts to read with potential of having some posts from amateur investors or completely unrelated posts. Identifying the most important posts that could lead to higher maximal potential profit (MPP) and lower maximal loss for investment is not a trivial task. In this paper, propose to use determinantal point processes and variational autoencoders to identify high quality posts from the given rationales. Experimental results suggest that our method mines quality posts compared to random selection and also latent variable modeling improves improves the quality of selected posts
The role of habitat features in a primary succession
In order to determine the role of habitat features in a primary succession on lava domes of Terceira Island (Azores) we addressed the following questions: (1) Is the rate of cover development related to environmental stress? (2) Do succession rates differ as a result of habitat differences? One transect, intercepting several habitats types (rocky hummocks,
hollows and pits, small and large fissures), was established from the slope to the summit of a 247 yr old dome. Data on floristic composition, vegetation bioarea, structure, demography and soil nutrients were collected. Quantitative and qualitative similarities among habitats were also analyzed. Cover development and species accumulation are mainly dependent on
habitat features. Habitat features play a critical role in determining the rate of succession by providing different environmental conditions that enable different rates of colonization and
cover development. Since the slope’s surface is composed of hummocks, hollows and pits
the low succession rates in these habitats are responsible for the lower rates of succession in this geomorphologic unit, whereas the presence of fissures in the dome’s summit accelerates its succession rate
Visual analysis of interactive document clustering streams
Interactive clustering techniques play a key role by putting the user in the clustering loop, allowing her to interact with document group abstractions instead of full-length documents. It allows users to focus on corpus exploration as an incremental task. To explore Information Discovery's incremental aspect, this article proposes a visual component to depict clustering membership changes throughout a clustering iteration loop in both static and dynamic data sets. The visual component is evaluated with an expert user and with an experiment with data streams
GGNN@Causal News Corpus 2022: Gated graph neural networks for causal event classification from social-political news articles
The discovery of causality mentions from text is a core cognitive concept and appears in many natural language processing (NLP) applications. In this paper, we study the task of Event Causality Identification (ECI) from social-political news. The aim of the task is to detect causal relationships between event mention pairs in text. Although deep learning models have recently achieved a state-of-the-art performance on many tasks and applications in NLP, most of them still fail to capture rich semantic and syntactic structures within sentences which is key for causality classification. We present a solution for causal event detection from social-political news that captures semantic and syntactic information based on gated graph neural networks (GGNN) and contextualized language embeddings. Experimental results show that our proposed method outperforms the baseline model (BERT (Bidirectional Embeddings from Transformers) in terms of f1-score and accuracy
Explaining neighborhood preservation for multidimensional projections
Dimensionality reduction techniques are the tools of choice for exploring high-dimensional datasets by means of low-dimensional projections. However, even state-of-the-art projection methods fail, up to various degrees, in perfectly preserving the structure of the data, expressed in terms of inter-point distances and point neighborhoods. To support better interpretation of a projection, we propose several metrics for quantifying errors related to neighborhood preservation. Next, we propose a number of visualizations that allow users to explore and explain the quality of neighborhood preservation at different scales, captured by the aforementioned error metrics.We demonstrate our exploratory views on three real-world datasets and two state-of-the-art multidimensional projection techniques.São Paulo Research Foundation (FAPESP) (grant 2012/07722-9)CAPES–NUFFIC 028/1
UCCNLP@SMM4H’22:Label distribution aware long-tailed learning with post-hoc posterior calibration applied to text classification
The paper describes our submissions for the Social Media Mining for Health (SMM4H) workshop 2022 shared tasks. We participated in 2 tasks: (1) classification of adverse drug events (ADE) mentions in english tweets (Task-1a) and (2) classification of self-reported intimate partner violence (IPV) on twitter (Task 7). We proposed an approach that uses RoBERTa (A Robustly Optimized BERT Pretraining Approach) fine-tuned with a label distribution-aware margin loss function and post-hoc posterior calibration for robust inference against class imbalance. We achieved a 4% and 1 % increase in performance on IPV and ADE respectively when compared with the traditional fine-tuning strategy with unweighted cross-entropy loss
UNLPSat TextGraphs-16 Natural Language Premise Selection task: Unsupervised Natural Language Premise Selection in mathematical text using sentence-MPNet
This paper describes our system for the submission to the TextGraphs 2022 shared task at COLING 2022: Natural Language Premise Selection (NLPS) from mathematical texts. The task of NLPS is about selecting mathematical statements called premises in a knowledge base written in natural language and mathematical formulae that are most likely to be used to prove a particular mathematical proof. We formulated this task as an unsupervised semantic similarity task by first obtaining contextualized embeddings of both the premises and mathematical proofs using sentence transformers. We then obtained the cosine similarity between the embeddings of premises and proofs and then selected premises with the highest cosine scores as the most probable. Our system improves over the baseline system that uses bag of words models based on term frequency inverse document frequency in terms of mean average precision (MAP) by about 23.5% (0.1516 versus 0.1228)
- …